Data Wrangling is the concept of arranging your dataset into a workable format for analysis. When retreiving data from various sources, it is not always in a format that is ready to be analyzed. There are times when there are missing or incorrect values in the dataset which will reduce the integrity of the analysis performed on the dataset. The process of Data Wrangling has been known to take up as much as 70% of a data scientist's time. In this notebook, we will review a few different formats in which you may retrieve data and what are some of the initial techniques to ensure that the data is ready for analysis.
One of the common formats for datasets is the comma seperated values or csv format. This format works well with both Microsoft Excel and Google Sheets as both platforms allow you to view them in a spreadsheet format. This format is also supported by the pandas DataFrame. In the code below, we read in a csv into a DataFrame, df, and display the first 5 entries.
In [2]:
import numpy as np
import pandas as pd
df = pd.read_csv('Master.csv')
df.head(5)
Out[2]:
From the output, you see that this dataset gives information about baseball players. Here is the last 10 entries of our Data Frame.
In [3]:
df.tail(10)
Out[3]:
One of the first things that you can do is to analyze the DataFrame to see if there is anything that doesn't look correct, such as variation in the number of entries in each column, min, max value etc. This can be easily assess through the describe()
In [4]:
df.describe()
Out[4]:
As we can see from this dataset, there is a lot of variation in the count across each column. Some of this is due to the fact that there are baseball players in this dataset who are still alive today, thus there is no information on their death date. Some of this is can just be due to incomplete information being provided. This is where Data Wrangling techniques must be done so that we can have a good dataset to work with.
SQL queries can be used on DataFrames to extract the necessary data through the SQLite syntax. The pandasql provides you with the necessary API for SQL queries. Below are a few examples. The first query retrieves the first 10 entries of the birthMonth and birthYear columns.
In [5]:
import pandasql
q = 'SELECT birthMonth, birthYear FROM df LIMIT 10'
sql_sol = pandasql.sqldf(q.lower(),globals())
sql_sol
Out[5]:
The query below retrieves data on players that are less than 200 lbs and organizes the entries by birthCountry
In [6]:
q2 = 'SELECT playerID,birthCountry,bats,weight FROM df WHERE weight < 200 GROUP BY birthCountry LIMIT 10'
sql_sol2 = pandasql.sqldf(q2.lower(),globals())
sql_sol2
Out[6]:
By using SQL quieries, it makes it simple to retrieve the data that you would like to analyze and then wrangle that data specifically if necessary.
Another common format is the JSON format. Data is commonly in this format when you are working with Relational Databases. The example below shows how one can retrieve data from a database using that website's REST API. Here we show to load JSON data into a Python dictionary. Here we use the requests library and then load the data using the json library. Pprint is used to print the JSON in a more readable format.
In [7]:
import json
import requests
import pprint
url = 'http://ws.audioscrobbler.com/2.0/?method=album.getinfo&api_key=4beab33cc6d65b05800d51f5e83bde1b&artist=Cher&album=Believe&format=json'
data = requests.get(url).text
df3 = json.loads(data)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(df)
Once the data is in the dictonary, we can access the information as such:
In [8]:
df3['album']['playcount']
Out[8]:
Here we can see that there are some entries that are not there. There is functiona called fillna that can be used to insert some value for the NaN entries. Here we insert a -1 for all NaN entries in the deathMonth column
In [9]:
print df['deathMonth'].head()
In [10]:
df['deathMonth'].fillna(-1).head()
Out[10]: